Machine Learning іп the 
audio domain 


When the neural network is overkill or where are the 
limits of lightweight models 
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Audio domain 


Analyzing sounds using 
neural networks 


* Call Centers 

e Virtual Assistants 

« Speech and music 
generation 


Pic source 
https://www.linkedin.com/pulse/turing-test-music-why-spotify-flooded-songs-mathieu-govaarts/ 


You are just a machine, an 
imitation of life. 


Can a robot write a symphony? 
create a beautiful 
masterpiece? 


Problem Statement 


Data modality апа SotA 


When working with media 
data, we usually use large 
neural networks. 


But: 
° They are resource-heavy 
° They are slower than light 
models 


Usually it is Transformers models 


Pic source https://comicvine.gamespot.com/ 


Time for SotA 


It takes LONG time to train SotA model for the 
task we'll discuss today 


Train Inference 


OEE ЖЕШ ШЕ ЖЕЛШ ЖШ 


15h for first 7h to read and 1.3s for one 

epoch, 40k prepare audio file processing 

audio files files (including 
reading) 


GPU: RTX 3090Ti, 24GB 


Audio-domain tasks 


Understanding and Generation 


Understanding (Classification) 


* Classification, e.g. emotions 
classification 

* "Token classification"—voice 
activity detection 

* Speaker separation, user 
verification 

e Automatic Speech Recognition 
(ASR) 
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Pic source https://kairosgame.com/ 


Generation 


° Voice cloning 
° Textto speech 
e Speech and music generation 


Pic source https://twitter.com/ai__pub https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0 
https://www.holdcom.com/support/telephony-audio-file-formats/ 


Experiments & Business 


Classification 


Evaluation of the call quality for the call center: 


Stable connection ? Unstable connection 


Regression 


Interviewing a candidate for an English Teacher position 


100 50 0 
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Candidate Human Candidate 
accepted verification declined 
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Fluency and pronunciation are evaluated separately using 5-points-scale 


Gradient Boosting оп Decision 
Trees (GB on DT) 


Single Decision Tree Gradient Boosted Trees 
e Fast 
° Work on statistics 
aggregates of audio ER, 
* Butis it accurate? + 
+ 
md 


Wav2Vec2 


° SotA for speech 
recognition 

° Transformer model 

° CNNSs allows to 
encode local 
features 

* Butis it fast? 


Context 
representations 


Quantized 
representations 


Latent speech 
representations 


Raw waveform 
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ғ Contrastive loss 


ransformer 
Masked 


Training Process 


Preprocessing 


For ОВ оп DT For Wav2Vec2 


° Collect amplitude and e Resample to 16kHz 
melspectrograms e Just truncation 
statistics 


е Mean, median, min, 
max, std, kurtosis, 
skew... over mel for 
every frequency and 
raw wav 


What is melspectrogram? 


Raw audio Melspectrogram 


Training 


For ОВ оп DT For Wav2Vec2 


° Catboost ° Wav2Vec2 base 

° Small trees (depth 6) ° Freeze feature encoder 

° 100-10000 iterations ° Learning rate schedulers 
° 10 epochs 


Time metrics. Classification 


For ОВ оп DT For Wav2Vec2 


* Train: 45 min to read data * Train: 45 min to read data 
° 0.55 to train on GPU e 5h to train оп GPU (10 
° 5k audio-files epochs) 
e Inference: 5 files, 2s to e 5k audio-files, batch size 6 
read data, 0.1s to inference e Inference: 5 files, 2s to 
on CPU read data, 105s to 


inference on CPU 


Time metrics. Regression 


For ОВ оп DT For Wav2Vec2 


e Train: 7h to read data * Train: 7h to read data 

° 5 min to train on GPU e 90h to train on GPU (10 

e 40k audio-files epochs) 

e Inference: 5 files, 2s to e 40k audio-files, batch size 6 
read data, 0.1s to inference e Inference: 5 files, 2s to read 
on CPU data, 105s to inference on 

CPU 


Results 


Results. Classification task 


Natural imbalance! 
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Results. Regression task 
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СВ оп DT Model’s label Model's label 


Wav2Vec 


Business Outcomes 


Outcomes 


Time and $ / accuracy trade-off 


° We took GB on DT for 
classification 
° We took Wav2Vec2 for regression 


Where are the limits and what is 
overkill? 


Pic source https://medium.com/@devsociety 


Thank you for your 
attention! 


Vote for my talk 


m Roman Smirnov 
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